import besca as bc
import scanpy as sc
import pandas as pd
import pkg_resources
The datasets that are already annotated and should be used for training. If you only use one dataset please use list of one.
adata_trains = [sc.read(pkg_resources.resource_filename('besca', 'datasets/data/Smillie2019_processed.h5ad'))]
The dataset of interest that should be annotated.
adata_pred = sc.read(pkg_resources.resource_filename('besca', 'datasets/data/haber_processed.h5ad'))
adata_orig = sc.read(pkg_resources.resource_filename('besca', 'datasets/data/haber_processed.h5ad'))
Give your analysis a name.
analysis_name = 'auto_annot_Haber2017_with_Smillie2019_Cluster'
Specify column name of celltype annotation you want to train on.
celltype_train ='dblabel'
celltype_test = 'dblabel'
Choose a method:
method = 'logistic_regression'
Specify merge method if using multiple training datasets. Needs to be either scanorama or naive.
merge = 'scanorama'
Decide if you want to use the raw format or highly variable genes. Raw increases computational time and does not necessarily improve predictions.
use_raw = False
You can choose to only consider a subset of genes from a signature set.
genes_to_use = 'all'
# Select epithelial subset from Smillie2019 dataset
epithelial_subset = bc.subset_adata(adata_trains[0], adata_trains[0].obs.celltype_highlevel == 'Epi', raw=False)
adata_trains[0] = epithelial_subset
# Convert mouse symbols (MGI) to human symbols (HGNC)
mousehuman_file = pkg_resources.resource_filename('besca', 'datasets/homologs/MGItoHGNC.csv')
mousehuman=pd.read_csv(mousehuman_file,sep='\t',header='infer', encoding="unicode_escape")
mousehuman.index=mousehuman['MGI']
conversion=pd.Series(data=mousehuman['HGNC'], index=mousehuman.index)
# Convert mouse symbols (MGI) to human symbols (HGNC)
adata_orig.var.rename(columns={'SYMBOL':'MGI'}, inplace=True)
adata_orig.var['SYMBOL'] = adata_orig.var['MGI'].map(lambda x: conversion.get(x, default='') if type(conversion.get(x, default='')) == str else conversion.get(x, default=None).values[0])
adata_orig.var.index = adata_orig.var.SYMBOL
adata_orig.var_names_make_unique()
adata_pred = adata_orig.copy()
This function merges training datasets, removes unwanted genes, and if scanorama is used corrects for datasets.
adata_train, adata_pred = bc.tl.auto_annot.merge_data(adata_trains, adata_pred, genes_to_use = genes_to_use, merge = merge)
The returned scaler is fitted on the training dataset (to zero mean and scaled to unit variance).
classifier, scaler = bc.tl.auto_annot.fit(adata_train, method, celltype_train)
Use fitted model to predict celltypes in adata_pred. Prediction will be added in a new column called 'auto_annot'. Paths are needed as adata_pred will revert to its original state (all genes, no additional corrections). The threshold should be set to 0 or left out for SVM. For logisitic regression the threshold can be set.
adata_predicted = bc.tl.auto_annot.adata_predict(classifier = classifier, scaler = scaler, adata_pred = adata_pred, adata_orig = adata_orig, threshold = 0.7)
Write out metrics to a report file, create confusion matrices and comparative umap plots
adata_pred.obs
%matplotlib inline
bc.tl.auto_annot.report(adata_pred=adata_predicted, celltype=celltype_test, method=method, analysis_name=analysis_name,
train_datasets=adata_trains, test_dataset=adata_orig, merge=merge, use_raw=False,
genes_to_use=genes_to_use, remove_nonshared=True, clustering='leiden', asymmetric_matrix=True)
sc.pl.umap(adata_predicted, color=[celltype_test, 'auto_annot'])
sc.pl.umap(adata_predicted, color=[celltype_test, 'auto_annot'], legend_loc='on data', legend_fontsize=8)
sc.pl.umap(adata_predicted, color=[celltype_test, 'auto_annot'], legend_fontsize=7, wspace = 1.4, save = '.svg')
sc.pl.umap(adata_predicted, color=[celltype_test, 'auto_annot'], legend_loc='on data', legend_fontsize=7, wspace = 1.4, save = '.ondata.svg')
adata_train
from sinfo import sinfo
sinfo()